訓練Pytorch的Transformer模型

pytorch transformer

AlbertShiu 2023-10-19 10:05:19 ‧ 1317 瀏覽

分享至

樣本資料準備

'''
Hyperparameters:
These values define the architecture and behavior of the transformer model:
src_vocab_size, tgt_vocab_size: Vocabulary sizes for source and target sequences, both set to 5000.
d_model: Dimensionality of the model's embeddings, set to 512.
num_heads: Number of attention heads in the multi-head attention mechanism, set to 8.
num_layers: Number of layers for both the encoder and the decoder, set to 6.
d_ff: Dimensionality of the inner layer in the feed-forward network, set to 2048.
max_seq_length: Maximum sequence length for positional encoding, set to 100.
dropout: Dropout rate for regularization, set to 0.1.
'''
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

'''
This line creates an instance of the Transformer class, initializing it with the given
hyperparameters. The instance will have the architecture and behavior defined by these 
hyperparameters.
'''
transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

'''
Generate random sample data.
src_data: Random integers between 1 and src_vocab_size, representing a batch of source sequences with shape (64, max_seq_length).
tgt_data: Random integers between 1 and tgt_vocab_size, representing a batch of target sequences with shape (64, max_seq_length).
These random sequences can be used as inputs to the transformer model, simulating a batch of data with 64 examples and sequences of length 100.
'''
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

此程式碼片段示範如何初始化Transformer模型並產生可輸入模型的隨機來源序列和目標序列。所選的超參數決定了變壓器的具體結構和屬性。此設定可以是較大腳本的一部分，其中模型根據實際的序列到序列任務（例如機器翻譯或文字摘要）進行訓練和評估。

訓練模型

'''
criterion = nn.CrossEntropyLoss(ignore_index=0): Defines the loss function as cross-entropy loss. The ignore_index argument is set to 0, meaning the loss will not consider targets with an index of 0 (typically reserved for padding tokens).
optimizer = optim.Adam(...): Defines the optimizer as Adam with a learning rate of 0.0001 and specific beta values.
'''
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

'''
transformer.train(): Sets the transformer model to training mode, enabling behaviors like dropout that only apply during training.
'''
transformer.train()

'''
The code snippet trains the model for 100 epochs using a typical training loop:

for epoch in range(100): Iterates over 100 training epochs.
optimizer.zero_grad(): Clears the gradients from the previous iteration.
output = transformer(src_data, tgt_data[:, :-1]): Passes the source data and the target data (excluding the last token in each sequence) through the transformer. This is common in sequence-to-sequence tasks where the target is shifted by one token.
loss = criterion(...): Computes the loss between the model's predictions and the target data (excluding the first token in each sequence). The loss is calculated by reshaping the data into one-dimensional tensors and using the cross-entropy loss function.
loss.backward(): Computes the gradients of the loss with respect to the model's parameters.
optimizer.step(): Updates the model's parameters using the computed gradients.
print(f"Epoch: {epoch+1}, Loss: {loss.item()}"): Prints the current epoch number and the loss value for that epoch.
'''
for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

此程式碼片段在隨機產生的來源序列和目標序列上訓練 Transformer 模型 100 個時期。它使用 Adam 優化器和交叉熵損失函數。每個時期都會顯示出損失，以便您監控訓練進度。在現實場景中，您可以將隨機來源序列和目標序列替換為任務（例如機器翻譯）中的實際資料。

評估模型表現

'''
transformer.eval(): Puts the transformer model in evaluation mode. This is important because
it turns off certain behaviors like dropout that are only used during training.
'''
transformer.eval()

'''
Generate random sample validation data.
val_src_data: Random integers between 1 and src_vocab_size, representing a batch of validation source sequences with shape (64, max_seq_length).
val_tgt_data: Random integers between 1 and tgt_vocab_size, representing a batch of validation target sequences with shape (64, max_seq_length).
'''
val_src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
val_tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

'''
Validation Loop:

with torch.no_grad(): Disables gradient computation, as we don't need to compute gradients during validation. This can reduce memory consumption and speed up computations.
val_output = transformer(val_src_data, val_tgt_data[:, :-1]): Passes the validation source data and the validation target data (excluding the last token in each sequence) through the transformer.
val_loss = criterion(...): Computes the loss between the model's predictions and the validation target data (excluding the first token in each sequence). The loss is calculated by reshaping the data into one-dimensional tensors and using the previously defined cross-entropy loss function.
print(f"Validation Loss: {val_loss.item()}"): Prints the validation loss value.
'''
with torch.no_grad():

    val_output = transformer(val_src_data, val_tgt_data[:, :-1])
    val_loss = criterion(val_output.contiguous().view(-1, tgt_vocab_size), val_tgt_data[:, 1:].contiguous().view(-1))
    print(f"Validation Loss: {val_loss.item()}")

此程式碼片段在隨機生成的驗證資料集上評估Transformer模型，計算驗證損失並顯示它。在現實場景中，隨機驗證資料應替換為您正在處理的任務中的實際驗證資料。驗證損失可以讓您了解模型在未見過的數據上的表現如何，這是模型泛化能力的關鍵衡量標準。